feat(server): implement multi-tenancy provider with file and HTTP backends#979
feat(server): implement multi-tenancy provider with file and HTTP backends#979Pangjiping wants to merge 6 commits into
Conversation
…kends Wire TenantProvider interface through full request chain: - TenantProvider Protocol + FileTenantProvider + HTTPTenantProvider - Auth middleware multi-tenant/single-tenant mode branching - ContextVar-based tenant propagation (tenants/context.py) - Dynamic namespace resolution in KubernetesSandboxService - [tenants] config section in server.toml (provider=file|http) - Startup guards: docker+tenants fatal, api_key+tenants mutual exclusion - HTTP provider: per-key TTL cache, sync fetch, max_stale, 401 eviction - 27 e2e tests covering both providers end-to-end - Update OSEP-0012 with formal interface spec and HTTP contract Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- New script scripts/python-k8s-e2e-multi-tenant.sh with file/http variants - Add multi-tenant-file and multi-tenant-http matrix entries to nightly workflow - Helm chart: support tenantsToml value for mounting tenants.toml ConfigMap - HTTP variant: mock Python provider on host, accessible via Kind bridge IP - Both variants verify auth (valid key→200, invalid→401, missing→401) then run full SDK kubernetes-mini test suite with tenant key Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Use Kind control-plane container's gateway IP as primary detection, with fallbacks. Add pre-deploy reachability check for the mock endpoint. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pangjiping
left a comment
There was a problem hiding this comment.
Solid architecture — TenantProvider protocol, ContextVar propagation, namespace-scoped isolation all look correct. One critical issue (sync HTTP blocking the event loop) should be fixed before merge; the rest are improvements.
- Fix sync HTTP blocking event loop: use asyncio.to_thread() in auth middleware - Add per-key singleflight pattern to prevent thundering herd on TTL expiry - Log warning at startup when HTTP tenant endpoint is not HTTPS - Rename _resolve_tenants_path to resolve_tenants_path (public API) - Use tuple for TenantEntry.api_keys for true frozen immutability - Extract validate_tenant_config() so startup guards are testable - Add [tenants] section to configuration.md, README, and example configs Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d02b0be6c3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self._is_proxy_path(request.url.path): | ||
| return await call_next(request) |
There was a problem hiding this comment.
Authenticate proxy requests before resolving tenant namespaces
In multi-tenant Kubernetes deployments, /sandboxes/{id}/proxy/... requests return before the tenant provider is consulted, so current_tenant is never set even when the client sends a valid tenant API key. The proxy handler calls lifecycle.sandbox_service.get_endpoint(...), and the new Kubernetes service resolves that through _resolve_namespace(), which falls back to the default namespace when no tenant context exists; sandboxes created under a tenant namespace will therefore be reported missing or proxied from the wrong namespace. The proxy exemption needs to be conditional or still establish tenant context in multi-tenant mode.
Useful? React with 👍 / 👎.
| - name: tenants | ||
| mountPath: /etc/opensandbox/tenants.toml | ||
| subPath: tenants.toml | ||
| readOnly: true |
There was a problem hiding this comment.
Mount tenants ConfigMap without subPath for reloads
The file provider is documented and implemented as a hot-reloading tenants.toml, but the Helm chart mounts the ConfigMap key via subPath; Kubernetes documents that ConfigMap mounts using subPath do not receive updates when the ConfigMap changes. As a result, updating tenantsToml in a Helm release will not change /etc/opensandbox/tenants.toml in the running server pod, so key rotation/add/remove will not be observed until the pod restarts.
Useful? React with 👍 / 👎.
| workload = self.workload_provider.get_workload( | ||
| sandbox_id=sandbox_id, | ||
| namespace=self.namespace, | ||
| namespace=self._resolve_namespace(), |
There was a problem hiding this comment.
Propagate tenant context into renew workers
When renew_intent is enabled in multi-tenant Kubernetes mode, renew work is processed by the background RenewIntentConsumer tasks, not in the authenticated request context. Those workers only carry sandbox_id, so this new _resolve_namespace() call falls back to the configured default namespace and cannot find or renew sandboxes that were created in a tenant namespace; the renew work item needs to include the tenant namespace or otherwise set tenant context before calling the service.
Useful? React with 👍 / 👎.
| if not is_leader: | ||
| event.wait(timeout=self._config.timeout_seconds) | ||
| with self._lock: | ||
| cached = self._cache.get(api_key) | ||
| if cached: | ||
| return cached.tenant | ||
| return None |
There was a problem hiding this comment.
Preserve HTTP lookup errors for waiters
For concurrent cold lookups of the same valid API key, only the leader performs the HTTP request while the other requests wait here; if the leader times out, gets a 5xx, or simply takes longer than this timeout, the waiters see no cache entry and return None, which the auth middleware converts to 401 INVALID_API_KEY. That makes provider outages or slow lookups look like invalid credentials for some callers instead of returning the same 503/tenant result as the leader, so the singleflight path should propagate the leader's result or error to waiters.
Useful? React with 👍 / 👎.
| _tenant_provider = FileTenantProvider(_tenants_path) | ||
| _tenant_provider.start() |
There was a problem hiding this comment.
Validate file tenants before starting
The multi-tenancy design added in oseps/0012-multi-tenancy.md says startup must validate that all tenant namespaces exist and are accessible, but file-provider startup only parses the TOML and starts the watcher. With a misspelled or unauthorized namespace the server comes up healthy and every request for that tenant fails later during workload operations, so the file-provider path should check the loaded namespaces against Kubernetes before accepting the config.
Useful? React with 👍 / 👎.
| def lookup(self, api_key: str) -> Optional[TenantEntry]: | ||
| with self._lock: | ||
| return self._lookup.get(api_key) |
There was a problem hiding this comment.
Use constant-time checks for file tenant keys
In file-provider mode, every authenticated request resolves the presented API key through a normal Python dict lookup. The multi-tenancy spec added in this commit explicitly requires constant-time API key comparison, but dict lookup/equality is key-dependent and does not provide that property, so this backend should compare candidate keys with a constant-time primitive (or store and compare keyed digests) rather than indexing directly by the secret.
Useful? React with 👍 / 👎.
Summary
Wire TenantProvider interface through full request chain:
Testing
Breaking Changes
Checklist